The HiveMC is an official Minecraft Java Edition Server. Minecraft is a sandbox video game developed by Mojang Studios. The game was created by Markus "Notch" Persson in the Java programming language and released as a public alpha for personal computers in 2009 before officially releasing in November 2011, with Jens Bergensten taking over development. Currently, Minecraft Java Edition surpasses 30 million lifetime sales.
HiveMC or The Hive is a Minigames server owned by Hive Games Limited. It was first registered in 2018. Focuses on providing players with different fun games such as SkyWars, Bedwars, Hide and Seek and DeathRun. Over 12 million unique minecraft accounts visited the server at least once.
Bedwars is strategy and player vs. player based minigame where you must protect your bed whilst trying to eliminate your opponents on islands in the sky. You can continue to respawn while your bed is safe. If your bed is destroyed, you can no longer respawn and you are eliminated once you die. Before being first released on multiplayer servers it was a custom game map, developed at 2012, to play with a company of friends. At the time was called as "Rush" and was developed by the man named Xisuma. Became very popular once was released on GommeHD, German minigames server, and name was changed to "Bedwars".
As of 2020, the Bedwars is one of the most popular gamemodes among Minecraft minigames. Average unique player count per day for this gamemode is over 10000 players. I also play bedwars and by this date I'm on 492 place among all 3.6 millions The Hive Bedwars players.
This Project will analyze best 1000 Bedwars players of The Hive.
The data consits of the players statistics - namely, the indicators of player's progress in the game. In this analysis we are interested in comparing those statistics among different players. Besides that, further in the project we will make comparison between countries from which players originated. We took a sample of top 1000 players because those players spent a lot of time on getting there. This also helps to avoid duplicate considering in the analysis many accounts of the same player, which obviously, will get the statistics too biased.
Our analysis will be based on data of 01.10.2020 obtained from HiveMC API(https://api.hivemc.com/). We will aditionally scrap the information about the country of the player. Below is the main attributes of data that will be obtained, scraped and used for our analysis:
For this project, data analysis and visualization contains 5 parts:
*- Points can be obtained by killing players(5 points), breaking beds (50-80 points), upgrading resource generators (5-15 points)
We are going to use Beatifulsoup package to get the data from HiveMC API and NameMC. In order to get our data we need to do 5 API requests and scrap from 1000 pages. The algorithm works as following: first we get data of first 200 players in the top from HiveAPI in form of JSON string and then transformed to object form. Using the Username data we then procceed to first 200 pages of NameMC to scrap the data about the country of the player. Once obtained all of the data will be written to CSV file for further use. Repeat the proccess for the rest 4 API requests. Webscrap took about 4 hours.
Source of webscrap: (https://ru.namemc.com/)
#import all the needed packages
import requests
from bs4 import BeautifulSoup
import json
import time
import csv
import numpy as np
# Import matplotlib
import matplotlib.pyplot as plt
# Import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
Next part of the code will not run since it takes to much time and rewrites the existing file. You can run it yourself if you would like to do that. Code also included in the GitHub repo. Changing the proxy each 5 scraps from NameMC could also help to boost the process without loosing the data.
#The Scrap took about 4 hours due to need in delay
# with open('hive_players.csv', mode='w') as hive_players:
# fieldnames = ['Place', 'Username', 'Points' , 'Victories', 'Games', 'Kills', 'Deaths', 'Beds', 'Teams Eliminated', 'Winstreak', 'Country']
# player = csv.DictWriter(hive_players, fieldnames=fieldnames)
# player.writeheader()
# for j in range (0, 5): #j itterates through api requests
# #we need to go through 5 api requests as max range of data that can be retrieved per 1 request is 200.
# k = j*200 #this will be the start place of the player
# m = k+200 #end point in request
# URL = 'https://api.hivemc.com/v1/game/BED/leaderboard/'
# URL = URL + str(k) + '/' + str(m)
# page = requests.get(URL)
# soup = BeautifulSoup(page.text, 'html.parser')
# y = json.loads(soup.prettify())
# f = 0
# for i in range(0, 200):
# #the id each api request is updated
# if f == 5:
# #that's needed since there limited amount of requests at a time range is allowed on NameMC website
# time.sleep(60)
# f = 0
# URL2 = 'https://ru.namemc.com/profile/' + y["leaderboard"][i]["username"] #
# page2 = requests.get(URL2)
# sup = BeautifulSoup(page2.text, 'html.parser')
# country = None
# #path to needed html element
# a1 = sup.find_all('div', class_ ='row')
# if len(a1)>=1 :
# #just to verify if we on right track
# a2 = a1[1].find_all('div', class_='col-lg-8')
# if len(a2)>=1 :
# a3 = a2[0].find_all('div', class_='card mb-3')
# if len(a3)>=3 :
# a4 = a3[2].find_all('div', class_='card-body py-1')
# if len(a4)>=1 :
# a5 = a4[0].find_all('div', class_='row')
# if len(a5)>=1 :
# for itter in range (0, len(a5)) : #itterating through this card to find right element
# if a5[itter].find('div', class_='col-md-4').text == "Страна" : #choosing the right element
# country = a5[itter].find('div', class_='col-auto').text
# #writing data to csv file...
# player.writerow({'Place' : j*200 + i + 1, 'Username' : y["leaderboard"][i]["username"], 'Points' : y["leaderboard"][i]["total_points"], 'Victories' : y["leaderboard"][i]["victories"], 'Games' : y["leaderboard"][i]["games_played"], 'Kills' : y["leaderboard"][i]["kills"], 'Deaths' : y["leaderboard"][i]["deaths"], 'Beds' : y["leaderboard"][i]["beds_destroyed"], 'Teams Eliminated' : y["leaderboard"][i]["teams_eliminated"], 'Winstreak' : y["leaderboard"][i]["win_streak"], 'Country' : country})
#Now in the pandas form:
import pandas as pd
table = pd.read_csv('hive_players.csv')
table[:10]
At this step we need to reshape our data a bit and assign appropriate column names.
#Missed values are given by "None" so we need to replace them
table.replace("None", np.nan, inplace = True)
# some columns with numeric values should be converted to int type
table = table.astype({"Points": "int", "Victories": "int", "Games": "int", "Kills": "int", "Deaths": "int", "Beds": "int", "Teams Eliminated": "int", "Winstreak": "int"})
table.head(5)
table.tail(5)
#Number of points is pretty large, more than 500 thousands for each player
#We need to convert this column, lets divide by thousand and round to 1 decimal place
points = table["Points"] #save that for later
table["Points"] = table["Points"].div(1000).round(1)
table.head(5)
#Now we need to rename Points column so that it represents correct info
table.rename(columns = {'Points' : 'Points (in thousands)'}, inplace = True)
table.head(5)
#Lets determine how many values are missing
table_validation = pd.DataFrame()
table_validation["Columns"] = list(table.columns)
table_validation["Count"] = list(table.count())
table_validation[:]
plt.figure(figsize=(18,6))
not_enough = [i if i < 1000 else 0 for i in table_validation["Count"]]
plt.bar(table_validation["Columns"], table_validation["Count"])
plt.bar(table_validation["Columns"], not_enough, color = 'gray')#color to indicate those columns where there is not enough data
plt.xticks(rotation = 90, fontsize = 12)
plt.show()
So the only missing values are the countries of some players, which is, due to some players not being registered to NameMC or not sharing their information about the country. Those entries with no data about country takes the dominant proportion. Further in geological analysis we will not use them.
Before we begin analyzing the ratios, lets look at the plots of kills and deaths individually
plt.figure(figsize=(18,6))
#creating plot for kills
plt.subplot(121)
plt.plot(table['Kills'])
plt.xlabel('Place')
plt.ylabel('Kill count')
plt.title('Figure 4.1 Number of kills of the player', fontsize = 16)
#creating plot for deaths
plt.subplot(122)
plt.plot(table['Deaths'])
plt.xlabel('Place')
plt.ylabel('Death count')
plt.title('Figure 4.2 Number of deaths of the player', fontsize = 16)
plt.show()
Now lets get a kill/death ratio (KDR) table to further analyze
kd_rt = pd.DataFrame() #separate dataframe to see the table
kd_rt['Place'] = table['Place'] #we use place, username of the player
kd_rt['Username'] = table['Username']
kd_rt['Kills/Deaths'] = table['Kills'].div(table['Deaths'])
kd_rt.head(10)
kd_rt['Kills/Deaths'].describe() #some descriptive statistics
kd_rt.sort_values(by=['Kills/Deaths'], ascending = False) #sort to see the best and worst players
plt.figure(figsize=(12,6)) #creating a histogram
plt.hist(kd_rt['Kills/Deaths'], 100, density=1, facecolor='b', alpha=0.75)
plt.xlabel('Kills/Deaths')
plt.ylabel('Probability')
plt.title('Figure 4.3 Histogram of Kills/Deaths ratio' , fontsize = 16)
plt.text(11, .45, r'$\mu=2.337,\ \sigma=1.484$', fontsize = 16)
plt.grid(True)
plt.show()
print('The value of skewness is: '+ str(kd_rt['Kills/Deaths'].skew().round(3))) #to calculate value of skewness
So we can clearly see from the Figure 4.3 and the value of skewness that Kills/Deaths ratio is positively skewed. The best player for the KDR is LOWIQQ and the worst is _XaviWxlf.
Now let's do the same for the wins and losses. Wins/Losses Ratio = WLR
plt.figure(figsize=(18,6))
#first plot
plt.subplot(121)
plt.plot(table['Victories'], color ='g')
plt.xlabel('Place')
plt.ylabel('Win count')
plt.title('Figure 4.4 Number of wins of the player', fontsize = 16)
#second plot
losses = table['Games'].sub(table['Victories'], axis = 0)
plt.subplot(122)
plt.plot(losses, color ='g')
plt.xlabel('Place')
plt.ylabel('Loss count')
plt.title('Figure 4.5 Number of losses of the player', fontsize = 16)
plt.show()
wl_rt = pd.DataFrame() #separate dataframe creation
wl_rt['Place'] = table['Place']
wl_rt['Username'] = table['Username']
wl_rt['Wins/Losses'] = table['Victories'].div(losses)
wl_rt.head(10)
wl_rt['Wins/Losses'].describe() #descriptive statistics
wl_rt.sort_values(by=['Wins/Losses'], ascending = False) #best and worst players...
plt.figure(figsize=(12,6)) #creating a histogram
plt.hist(wl_rt['Wins/Losses'], 100, density=1, facecolor='g', alpha=0.75)
plt.xlabel('Wins/Losses')
plt.ylabel('Probability')
plt.title('Figure 4.6 Histogram of Wins/Losses ratio' , fontsize = 16)
plt.text(125, .25, r'$\mu=3.139,\ \sigma=9.539$', fontsize = 16)
plt.grid(True)
plt.show()
print('The value of skewness is: '+ str(wl_rt['Wins/Losses'].skew().round(3)))
It's clear that the win/loss ratio has a positive skewness. We also can see something irregular, the player 'TryMeHacker' has a negative WLR which should be impossible. That's due to the in game bug (which is already fixed) that the player used to get impossible amount of wins. We should exclude this player from further analysis.
The biggest WLR has the player named tastydish. BbyTiger comes with the lowest WLR.
table = table[table['Username'] != 'TryMeHacker'] #excluding TryMeHacker from the table
kd_rt = kd_rt[kd_rt['Username'] != 'TryMeHacker']
wl_rt = wl_rt[wl_rt['Username'] != 'TryMeHacker']
table[295:297]
Next we will analyze the correlation between two ratios of players.
kd_wl = pd.concat([kd_rt, wl_rt['Wins/Losses']], axis = 1, join='inner')
kd_wl[:]
plt.figure(figsize=(8,8))
x1 = kd_wl['Kills/Deaths']
y1 = kd_wl['Wins/Losses']
plt.scatter(x1, y1, label=f'Correlation coefficient = {np.round(np.corrcoef(x1,y1)[0,1], 3)}') #correlation coef for data
plt.title('Figure 4.7 Correlation between Kills/Deaths and Wins/Losses')
plt.xlabel('Kills/Deaths')
plt.ylabel('Wins/Losses')
plt.legend(prop={'size': 11})
plt.show()
The Figure 4.7 shows that there is low positive linear correlation between KDR and WLR.
Distribution analysis
This shows that the player have a general tendency on getting higher WLR than a KDR. Also, WLR shows huge amount of outliers, players with extreme values of WLR, much higher than the mean. This has a huge effect on the variance of the data.
Generally kills, deaths and wins are showing the logarithmic tendency with respect to the ranking of the player by points while the losses not. This might be due to lost games not playing a big role towards getting in game points.
table.head(10)
For the following analyze we will look at the tendencies between the place (rank) of the player and points, beds, teams eliminated and winstreak. We will noot look at the kills, deaths, victories as we already discussed them in the previous part. We also will not look at the games for now, as it will be observed in another part.
#before we start it's better to divide the players by the rank groups
top25=table[0:24]
top100=table[25:99]
top250=table[100:249]
top1000=table[250:]
fig = plt.figure(figsize=(18,15))
fig.suptitle('Figure 4.8 Scatterplots, dependence on Ranking', fontsize=16)
ax1 = fig.add_subplot(421) #axes1
ax1.scatter(x=top25['Place'],
y=top25['Points (in thousands)'],
color = 'g')
ax1.scatter(x=top100['Place'],
y=top100['Points (in thousands)'],
color = 'y')
ax1.scatter(x=top250['Place'],
y=top250['Points (in thousands)'],
color = 'b')
ax1.scatter(x=top1000['Place'],
y=top1000['Points (in thousands)'],
color = 'r')
ax1.set_ylabel('Points (in thousands)', fontsize=15)
ax1.set_xlabel('Place', fontsize=15)
ax2 = fig.add_subplot(422) #axes2
ax2.scatter(x=top25['Place'],
y=top25['Beds'],
color = 'g')
ax2.scatter(x=top100['Place'],
y=top100['Beds'],
color = 'y')
ax2.scatter(x=top250['Place'],
y=top250['Beds'],
color = 'b')
ax2.scatter(x=top1000['Place'],
y=top1000['Beds'],
color = 'r')
ax2.set_ylabel('Beds', fontsize=15)
ax2.set_xlabel('Place', fontsize=15)
ax3 = fig.add_subplot(423) #axes3
ax3.scatter(x=top25['Place'],
y=top25['Teams Eliminated'],
color = 'g')
ax3.scatter(x=top100['Place'],
y=top100['Teams Eliminated'],
color = 'y')
ax3.scatter(x=top250['Place'],
y=top250['Teams Eliminated'],
color = 'b')
ax3.scatter(x=top1000['Place'],
y=top1000['Teams Eliminated'],
color = 'r')
ax3.set_ylabel('Teams Eliminated', fontsize=15)
ax3.set_xlabel('Place', fontsize=15)
ax4 = fig.add_subplot(424) #axes4
ax4.scatter(x=top25['Place'],
y=top25['Winstreak'],
color = 'g')
ax4.scatter(x=top100['Place'],
y=top100['Winstreak'],
color = 'y')
ax4.scatter(x=top250['Place'],
y=top250['Winstreak'],
color = 'b')
ax4.scatter(x=top1000['Place'],
y=top1000['Winstreak'],
color = 'r')
ax4.set_ylabel('Winstreak', fontsize=15)
ax4.set_xlabel('Place', fontsize=15)
fig.legend(['1-25 rank', '26-100 rank', '101-250 rank', '251-1000 rank'], prop={'size': 15})
plt.show()
As we can see from the Figure 4.8, Points, beds and teams eliminated clearly show the pattern close to logarithmic curve with points being the most close to the shape. Everywhere except the winstreak we can see the positive relationship between the rank of the player and the place. We will observe the tendencies in winstreak further in the analysis.
Not all of the players in top 1000 have a winstreak, so for this part we will need to look at those players who have a winstreak of at least 1.
table.sort_values(by = ['Winstreak'], ascending = False).head(10) #just sorted
table.loc[table['Winstreak']>0].sort_values(by = ['Winstreak'], ascending = False) #use this to determine how many players have winstreak
So we can see that 479 players have a winstreak greater than 0.
table['Winstreak'].describe() #descriptive statistics of winstreak of all players
table.loc[table['Winstreak']>0]['Winstreak'].describe() #descriptive statistics of winstreak of players that have a winstreak
Because the winstreak has a huge standart deviation and a lot of extremely big values, we will provide a histogram with values that are logarithmically scaled by a factor of 2 for the players with a winstreak.
#create a dataframe for scaled values
lg_ws = pd.DataFrame()
lg_ws['Place'] = table.loc[table['Winstreak']>0]['Place']
lg_ws['Username'] = table.loc[table['Winstreak']>0]['Username']
lg_ws['Log winstreak'] = np.log2(table.loc[table['Winstreak']>0]['Winstreak']).round(3) #scaling by log2
lg_ws.sort_values(by = ['Log winstreak'], ascending = False)[:]
plt.figure(figsize=(12,6)) #creating a histogram
plt.hist(lg_ws['Log winstreak'], 20, density=1, facecolor='g', alpha=0.75)
plt.xlabel('$log_2(Winstreak)$')
plt.ylabel('Probability')
plt.title('Figure 4.9 Histogram of winstreak (Log scaled)' , fontsize = 16)
plt.grid(True)
plt.show()
From the Figure 4.9 we can see that winstreak is positively skewed. The player that has the biggest winstreak goes by the username "Hwamzx".
lg_ws['KDR'] = kd_wl['Kills/Deaths'] #use kd_wl data frame we used while analyzing KDR and WLR to add columns to lg_ws
lg_ws['WLR'] = kd_wl['Wins/Losses']
lg_ws.dropna(axis=0) #to get rid of null values
#create rank groups for colors
lg_ws.loc[lg_ws['Place'] <= 25, 'Rank group'] = '1-25'
lg_ws.loc[lg_ws['Place'] > 25, 'Rank group'] = '26-100'
lg_ws.loc[lg_ws['Place'] > 100, 'Rank group'] = '101-250'
lg_ws.loc[lg_ws['Place'] > 250, 'Rank group'] = '251-1000'
fig = px.scatter(lg_ws, y="Log winstreak", x="Place", color="Rank group",
hover_data=['Username'], title = 'Figure 4.10 Log Winstreak/Place')
fig.show()
From the Figure 4.10 we can see that there is not much dependance between the place of the player and the winstreak the player gets
fig = px.scatter(lg_ws, y="Log winstreak", x="KDR", color="Rank group",
hover_data=['Username'], title = 'Figure 4.11 Log Winstreak/KDR')
fig.show()
From the Figure 4.11 we can see that the likelyhood of getting high winstreak is bigger if a player has good KDR.
fig = px.scatter(lg_ws, y="Log winstreak", x="WLR", color="Rank group",
hover_data=['Username'], title = 'Figure 4.12 Log Winstreak/WLR')
fig.show()
From the Figure 4.12 we can see that the likelyhood of getting high winstreak also is bigger if a player has good WLR, but not as much as KDR.
Distribution analysis
Most of the players, even if they're considered being in top 1000, doesn't have a high winstreak, however, some of the players managed to get extremely high winstreak. Clearly data is positively skewed.
The scatter plots show that there might be some dependance between WLR, KDR and winstreak, but not with the Place of the player.
Ranking players by the average points could be a good practice, as this shows how efficient each player in the game.
Before we begin, lets make a separate column for average points per game.
#call the column Pnt_game
#refer back to points dict we saved for this moment
table['Pnt_game'] = points.div(table['Games']).round(2) #divide points by amount of games
table[['Place', 'Username', 'Pnt_game']].head(10)
#now, lets sort the values
table[['Place', 'Username', 'Pnt_game']].sort_values(by = ['Pnt_game'], ascending = False)
#create rank groups for colors
table.loc[table['Place'] <= 25, 'Rank group'] = '1-25'
table.loc[table['Place'] > 25, 'Rank group'] = '26-100'
table.loc[table['Place'] > 100, 'Rank group'] = '101-250'
table.loc[table['Place'] > 250, 'Rank group'] = '251-1000'
#create boxplot
px.box(table, x = 'Pnt_game', hover_data=['Username', 'Place'],
title = 'Figure 4.13 Box Plot for average points per game')
Figure 4.13 shows us two outliers, lower outlier 'BbyTiger' and upper outlier 'tastydish'.
We can see that the average points per game for all players is almost symmetrically skewed.
50% of the players are getting on average less than 230 points per game.
#create a boxplots grouped by rank
px.box(table, x = 'Pnt_game', hover_data=['Username', 'Place'], color = 'Rank group',
title = 'Figure 4.14 Box Plot for average points per game, grouped by rank')
Figure 4.14 displays some interesting results. While rank groups '1-25' and '251-1000' are both close to be symmetrically skewed, in contrast, rank group '26-100' has a positive skewness and rank group '101-250' has a negative skewness.
The only outliers belong to the rank group '251-1000' (tastydish and YDMYK), and they are the best 2 players by average amount of points per game, which could imply that those two accounts belong to experienced players who own several game accounts.
As expected, on average, the better the rank group, the greater amount of points players from that rank group are getting per game. We can see that from the box plot as each successive box plot's box is further to the right than the previous one.
Now lets rank the players by the average point per game.
#using rank() function we will add new column
table['Rank pt/g'] = table['Pnt_game'].rank(ascending = False)
table[['Username', 'Place', 'Rank pt/g']][:]
#creating a scatter plot to observe a correlation between two rankings
fig = px.scatter(table, y="Rank pt/g", x="Place", color="Rank group",
hover_data=['Username'], title = 'Figure 4.15 Scatter plot of two rankings')
fig.show()
#print correlation coefficient between two rankings
print('Correlation coefficient = ' + str(np.round(np.corrcoef(table['Rank pt/g'],table['Place'])[0,1], 3)))
While the rank groups have a tendency to get greater points as better the group, for the players individually, there is very low correlation ($r= 0.313$) between the place by points and the rank we proposed.
Yet, the greater the average points per game the player has, the faster this player will be getting to high ranks. Ranking the players for their points per game could be a good indicator for future tendencies in the rankings.
Lets begin from creating dataframe for showing the countries and number of the top 1000 players in those countries.
cntry = pd.DataFrame()
#group by to retrieve number of players in each country, count by usernames
cntry = table[['Country', 'Username']].groupby(['Country']).count()
#move index to column
cntry.reset_index(inplace=True)
#rename to fit the purpose
cntry = cntry.rename(columns={'Username':'Count'})
#sort by count of players
cntry = cntry.sort_values(by = ['Count'], ascending = False)
cntry[:]
So we have data about players from 68 different countries that we know off. United States has the most players from top 1000 among the World.
#summing all of the points of players for each country
cntry2 = table[['Country', 'Points (in thousands)']].groupby(['Country']).sum()
cntry2 = cntry2.sort_values(by = ['Points (in thousands)'], ascending = False)
cntry2.reset_index(inplace=True)
#merging with previous table
cntry = pd.merge(cntry, cntry2, on = 'Country')
cntry[:]
#for the analysis we will use average points per single player of a country
cntry['Points per player (in thousands)'] = cntry['Points (in thousands)'].div(cntry['Count']).round(3)
#also, rank them by points per player
cntry['Rank'] = cntry['Points per player (in thousands)'].rank(ascending = False)
cntry = cntry.sort_values(by = ['Rank'])
cntry[:]
#creating a ge scatter plot that shows number of players by size of dots, avg points per player by color
fig = px.scatter_geo(cntry, locations="Country", locationmode = 'country names', color="Points per player (in thousands)",
hover_name="Country", size = 'Count', size_max = 50,
projection="natural earth", hover_data = ['Rank', 'Points (in thousands)'],
color_continuous_scale=px.colors.sequential.Bluered,
title = 'Figure 4.16 Scatter world map of countries and the points per player')
fig.show()
Figure 4.16 presents us interactive information about how players distributed around the world, clearly, most of the player base is located in the Europe and Northern America, but there is also some players from Australia and Russia.
From the table we can see that Czech Republic is on the lead by average points per player. On the second and third places settled Israel and Bangladesh being on the very thin score margin (both at 2700 thousands per player).
Average points per player ranking is dominated by the countries with lower amount of players, as the score is less distributed.
Based on the analysis given above we could conclude several things about tendencies among top 1000 Hive Bedwars players.